Change-Point Detection in a High-Dimensional Multinomial Sequence Based on Mutual Information

Time-series data often have an abrupt structure change at an unknown location. This paper proposes a new statistic to test the existence of a change-point in a multinomial sequence, where the number of categories is comparable with the sample size as it tends to infinity. To construct this statistic, the pre-classification is implemented first; then, it is given based on the mutual information between the data and the locations from the pre-classification. Note that this statistic can also be used to estimate the position of the change-point. Under certain conditions, the proposed statistic is asymptotically normally distributed under the null hypothesis and consistent under the alternative hypothesis. Simulation results show the high power of the test based on the proposed statistic and the high accuracy of the estimate. The proposed method is also illustrated with a real example of physical examination data.


Introduction
The change-point problem was first proposed by Page [1]. It considers a model in which the distribution of the observed data changes abruptly at some point in time, which is common in biology [2], finance [3], literature [4] and epidemiology [5]. Change-point detection can be employed as a tool in time series segmentation. A typical reference in the field of study is [6]. Once a change-point is detected in a data sequence, it is used to split the data sequence into two segments so that both segments are modeled separately. On the other hand, from a practical point of view, behavior and policies can be adjusted based on changes in events of interest. So, it is very important to perform change-point detection.
There are mainly two problems in the change-point model: checking the existence of change points and estimating the positions of these change points. These issues have been studied in substantial literature. For example, see Sen and Srivastava [7] for the mean change in a normal distribution and Worsley [8] for a change in an exponential family using the maximum likelihood ratio method. Others include Bai [9] for the least squares estimate of mean shift in linear processes, Vexler [10] for the change-point problem in a linear regression model and Gombay [11] for the change-point in autoregressive time series, etc. See [12][13][14] for details.
A study has shown that most work on change-point problems has been done for continuous data [14]. In real life, however, many data are observed on a discrete scale. Common discrete distributions include binomial, multinomial and Poisson distributions. In this article, we consider the change-point problem in a multinomial sequence, which originated from the transcription of the Gospels [15]. The Lindisfarne Gospels were divided into several sections, assuming that only one author contributed to the writing of any section and that the sections written by any one author were continuous. The goal was to test whether a single author wrote the Gospels. The data may be the frequency of vocabulary or grammar used by the author of each section. In general, suppose that X 1 = (X 11 , · · · , X 1m ), · · · , X K = (X K1 , · · · , X Km ) are K independent multinomial variables with parameter (n 1 , p 1 ), · · · , (n K , p K ), where p i = (p i1 , · · · , p im ), ∑ m j=1 p ij = 1 , i = 1, · · · , K. On the ith point, there are n i experiments with m outcomes, and X i records the frequencies of m outcomes. We want to test H 0 : p 1 = · · · = p K v.s. H 1 : p 1 = · · · = p k * = p k * +1 = · · · = p K (1) where k * is the true change-point, 1 < k * < K . If H 0 is rejected, we further estimate k * .
To solve this problem, Wolfe and Chen [16] proposed several statistics based on the cumulative sum (CUSUM) method. Horváth and Serbinowska [17] used the maximum likelihood ratio and maximum chi-square statistic to test the existence of change points and derived their transformed limit distribution. Batsidis and Horváth [18] extended it and proposed a family of phi-divergence tests that involves broad statistics. Riba and Ginebra [19] performed a graphical exploration of a sequence of polynomial observations and found a break point. Note that they all assumed that the number of categories m is fixed.
In recent years, the rise of big data has made the high-dimensional change-point problem more important. Thus, it becomes urgent to consider high-dimensional multinomial data as the categories of one thing in life can be quite large, such as the type of stores selling certain items on a shopping platform and the type of illness of patients in an outpatient clinic during a day. In this paper, we consider problem (1) with m tending to infinity. Recently, Wang et al. [20] proposed a procedure based on Pearson's Chi-square test under the above scenario. Their idea is to pre-divide categories into two based on their probability magnitudes and to use the original and modified Pearson's Chi-square statistic for large and small categories, respectively. This pre-classification can balance sparse and dense signals, resulting in good statistical performance. So, here, we use the pre-classification idea to construct a test statistic for problem (1) with m tending to infinity.
Another tool used in this article is based on information entropy. Entropy, originally a concept in statistical physics, was introduced into information theory by Shannon [21]. It has been widely applied in change-point problems. Unakafov and Keller [22] used ordinal mode conditional entropy to detect change points. Ma and Sofronov [23] proposed a cross-entropy algorithm to estimate the number and positions of change points. Vexler and Gurevic [24] applied empirical likelihood method to change-point detection, in which the essence of empirical likelihood estimation is a density-based entropy estimation. Mutual information, denoted by MI, computed as the difference between entropy and conditional entropy, is popular in deep learning, for example, see [25,26]. In the area of machine learning, MI is similar to information gain, which is often used as a measure of the goodness of a step in an algorithm, such as the selection of node splitting in a tree. Therefore, MI is naturally used as a metric in event detection problems. Relevant works include [27,28], etc. We utilize the MI between data and their position in this paper, given that a large value of MI means a high probability that a change point occurs.
In this paper, we consider the offline change-point problem. We propose a test statistic based on the mutual information for the at most one change-point (AMOC) problem (1) with m tending to infinity as the sample size tends to infinity. We adopt the pre-classification idea in [20] here. The optimal change-point position can also be estimated by MI. We show that the proposed statistic has an asymptotic normal distribution under the null distribution, and the power of the test converges to one under the alternative hypothesis. Meanwhile, we point out the relationship between MI and the likelihood ratio. In fact, the proposed statistic is based on the likelihood ratio method. As is widely acknowledged, although there is no uniformly most powerful test for change-point detection in general [29,30], the test based on the likelihood ratio structure has a high power [31]. Simulation studies demonstrate the excellent power of the test based on the proposed statistic as well as the high accuracy of the estimation. The innovations we have made in this article are that we replace the Pearson Chi-square statistic in Wang et al. [20] with mutual information and achieve better performance in terms of power and accuracy compared to their method.
The remaining structure of this paper is as follows. In Section 2, we present the proposed test statistic and the estimation method of a change point. In Section 3, we provide simulation results. In Section 4, we illustrate the method with an example based on physical examination data. In Section 5, we conclude the paper with some remarks. The proofs of the theorems are given in Appendix A.

Entropy and Mutual Information
We first briefly introduce some concepts about entropy and mutual information.

Definition 1.
Suppose that x 1 , · · · , x u are the possible values taken by a random variable X, where u can be infinity. Let P X (x i ) be the probability that X = x i . The Shannon entropy of X is defined as (2) Definition 2. Let Y be a random variable that takes values in {y 1 , · · ·, y v }, where v can be infinity. The conditional entropy of X given Y is defined as where P X,Y , P X|Y are the joint probabilities of X and Y and the conditional probability of X given Y, respectively.

Definition 3.
Assume that X and Y are the same as in Definitions 1 and 2. The mutual information (MI) of X relative to Y is defined as The entropy value is larger when the data distribution is more symmetric. On the contrary, when the data are skewed, they have a small entropy [32]. The conditional entropy measures how much uncertainty is eliminated in X by observing Y. Obviously, mutual information can be written as the difference between entropy and conditional entropy, that is, MI(X; Y) = H(X) − H(X|Y). It represents the average amount of information about X that can be gained or the amount of reduction of uncertainty in X by observing Y. MI(X; Y) ≥ 0, and it becomes zero if X and Y are independent of each other.

Pre-Classification
For multinomial data, when the number of categories m is large, it is sometimes not realistic to treat all categories equally. For example, of all the cities in China, only a few of them account for half of the economy, which means that the rest of the cities have a small average share. The well-known Pareto principle [33] that 20% of the population owns 80% of the wealth in society also illustrates this phenomenon. Therefore, it is reasonable to classify the categories with different orders of magnitude.
Similar to Wang et al. [20], let B 0 be a subset of {1, . . . , m} such that max where the superscript c stands for the complement operator. Assume that min j∈A 0 q 0j a m > ε and min j∈A 1 q 1j a m > ε for some Then, m categories are divided into large and small orders of magnitude by a m denoted by A and B. A change from q 0 to q 1 = q 0 might occur either in A or B.
Let X iA be the component of X i in A for i = 1, . . . , K and q 0A be the component of q 0 in A. Let X iB and q 0B be similarly defined. Then, the marginal distributions of X iA and X iB under the null assumption are and In the next subsection, we construct a statistic built on the marginal distributions (5) and (6).
Here are some additional notations. Denote i=k+1 n i as the number of experiments in total, before and after time k, and as the number of successful trials in total, before and after be the corresponding frequencies.
For the data in A, let be the number of successful trials in total, before and after time k.
∑ j∈B X ij as the sum of successful trials in B of total, before, and after k.
be the corresponding frequencies. Subscript S denotes the sum of frequencies. Similarly, we define Z B , Z 0kB , Z 1kB , Z AS , Z 0kAS , Z 1kAS ,q Bj ,q 0kBj ,q 1kBj , j ∈ B,q AS ,q 0kAS ,q 1kAS . We illustrate some of the above notations in Table 1 in a more structured fashion. Table 1. Explanation of some notations.

Total
Before k After k

Test Statistic
We use MI between the data X = (X 1 , . . . , X K ) and the location of the data to construct the statistic. For the data in A, the entropy is The entropies in A before and after k are H 0kA = −q 0kBS logq 0kBS − ∑ j∈Aq 0kAj logq 0kAj (8) and respectively.
Denote Y k = I{the location of X i is before k} as the indicator function of the position of a sample relative to k. Note that, given the observations, P(Y k = 1) = N 0k N by the independence. By Section 2.1, the MI between X and Y k in A is where where H B , H 0kB and H 1kB are defined similarly as in (7)- (9).
The uncertainty of X given Y k would reach the largest reduction if k is at the true break point k * ; hence, either MI k * A or MI k * B should be large. On the contrary, if the sequence is stable, the value of MI k should be small for any k ∈ {1, . . . , K}.
Since A and B are unknown, in light of Wang et al. [20], we useÂ = {j :q j a m > Cε} to estimate A. Here, C > 0 is some constant. As shown in [20],Â is a consistent estimator of A if a m satisfies certain assumptions. LetB =Â c . Construct the test statistic for (1). Summation and maximization are conducted respectively for the MI ofÂ andB in G m,Â . The first term in G m,Â is the weighted log-likelihood ratio estimate, as pointed out after Lemma 1. The second term in G m,Â is based on the maximum norm of MI. It is widely acknowledged that the max-norm test is more suitable for sparse and strong signals, see [34,35]. r m is a threshold forÂ, which ensures that the second term in G m,Â converges to zero under H 0 . e m is a large number. Note that the statistic in [20] is based on the Pearson Chi-square statistic. Since in reality, the frequencies of small categories might be zeros, the Pearson Chi-square statistic forB is hence modified. The statistic presented here does not need to take into account the fact that a frequency may be zero, since by the definition of entropy, −p log p = 0 if p = 0. In order to study the properties of G m,Â better, we first give a lemma about MI .
. Then, 2NMI kA = L kA . It is also true by replacing A with B in all the subscripts, that is, 2NMI kB = L kB .
Note that L kA and L kB in Lemma 1 are estimations of minus two log likelihood ratios for data in A and B when the change-point is at k. Therefore, the problem based on MI can be transformed into the problem based on likelihood ratios.
By Lemma 1, the second term in (11) is e m I( max k=1,...,K L kÂ > r m ), and hence the existing limit theorems on likelihood ratios can be applied to it directly. The first term in (11) is N 2 L kB , which has the form of a weighted log likelihood ratio estimation. In Appendix A, we show that it is only an infinitesimal quantity away from some CUSUM statistic [36] using Taylor expansion and then prove the asymptotic distribution of G m,Â from related conclusions.
The sum of L kB without weighting, ∑ K k=1 L kB , is closely related to the Shiryayev-Roberts procedure [37,38]. It uses ∑ K k=1 Λ k as a statistic, where Λ k is the likelihood ratio when the change point is at k. It is widely applied to determine the best stopping criterion in sequential change-point monitoring (see, e.g., [39]). However, replacing unknown parameters in ∑ K k=1 Λ k with their maximum likelihood estimation, which leads to ∑ K k=1 L kB in this paper, would result in a complex asymptotic analysis [40]. So, here we use the weighted version 1 Theorem 1 shows that G m,Â is asymptotically normally distributed under the null hypothesis. The condition (i) in Theorem 1 ensures the consistency ofÂ, which was also assumed in Theorem 1 of [20]. The condition (ii) in Theorem 1 requires the threshold r m to be large enough in order to guarantee that e m I( max k=1,...,K 2NMI kÂ > r m ) converges to zero with probability one under the null hypothesis. Condition (iii) means that every n i is much less than N. Next, we focus on the properties of the statistic under the alternative hypothesis.
Theorem 2 establishes the consistency of the test under certain conditions when the probability in A or B changes. Condition (i) in Theorem 2 means that e m tends to infinity at a certain rate. It aims to ensure that G m,Â tends to infinity when the parameters in A change. Condition (ii) requires comparable sample sizes before and after the change point. The proofs of Theorem 1 and Theorem 2 are provided in Appendix A.
Once H 0 is rejected, we further use MI to estimate k * . If max that the power of the new statistic increases rapidly as the difference between the alternative hypothesis and the null hypothesis increases. At the same time, the precision ofk using pre-classification is also satisfactory.

Simulation
We conduct simulation experiments to assess the performance of the test procedures in empirical size, power and estimation in finite samples. All results are based on 1000 replications. We use R to obtain simulation results. The necessary R code is given in Appendix B.
To analyze the empirical size, we simulate multinomial data with parameter q 0 = ( ω d 1 d , 1−ω m−d 1 m−d ) under the null hypothesis without break with reference to [20]. The first d probabilities are much greater than those of the latter. Hence, in reality,Â can be chosen as {1, ......,d}. Following [20], we used = arg max i=1,...,m−1 are the sorted values ofq j . We consider different situations with the sample size K arranged from 50 to 500, and let m = K in each situation. For simplicity, we fix n i = 100, i = 1, . . . , K.
For the formula of G m,Â , we choose e m = m, r m = (2 log log N + d 2 log log log N) 2 according to the conditions in the above section. The simulation results with various combinations of (ω, d) are reported in Table 2. We observe that the empirical size of the test is 4.5-6.7%, which is thus around the nominal 5% level in different situations. Here, we show the case of ω ≤ 0.5. We also performed simulations for ω > 0.5 and found empirical values slightly higher than 5% (data not shown). To evaluate the power of the test, the alternative hypotheses stipulate a single break in the data sequence. We first consider parameters of two forms: ω is the proportion in A, 0 < ω < 1. 1 d denotes a d-dimensional vector with all components equal to 1. s represents the shift size. We consider different values of s when evaluating power and accuracy to better observe changes in efficiency and accuracy as the gap between the alternative and null hypotheses increases. % is the mod operation. The two alternative hypotheses assume that the change point is located in A and B, respectively. We consider k * = 0.2K and 0.5K to capture breaks in the beginning and middle of a sample. For comparison, we use two competitors: • The weighted maximum likelihood ratio statistic L = max • The statistic Q = ∑ K k=1 ∑ j∈B (L kj − L kj (0) ) + e m I( max k=1,...,K max j∈Â R kj > r m ) in [20], in which 2 ) and R kj = L kĵ q j , e m = K 4 3 and r m = log K log K in the simulations.
The results are summarized in Figure 1 for level α = 0.05. The size of L is on the high side, as seen from the curve at small s in Figure 1. The new test is very powerful, as evidenced by the rapid rate of convergence to 1 when s increases. In most cases, the empirical power of G m,Â is larger than the other two for alternative hypothesis (i). For the alternative hypothesis (ii), the three statistics perform equally well. These results further show that our test has higher power to detect a change located in the middle of the sample than in the beginning while the power is also still high.  L is the weighted maximum likelihood ratio statistic in [17]. Q is the statistic in [20]. ω = 0.3, d = 5.
We also briefly investigate how well the change-point location k * is approximated by the estimatork . We choose k * = 0.2K and 0.5K as the change-point location. In Tables 3 and 4, we report the mean and standard deviation of the absolute errors |k − k * | for the different choices of s and m under the alternative hypothesis (i) or (ii), respectively.
We compare our estimate with the maximum likelihood ratio estimatek L = arg max k=1,...,KΛ k and k Q in [20].
The corresponding absolute errors in Table 3 and Table 4 underscore the considerable precision ofk, which improves when s is increased 0.3 from to 0.8. For the alternative (i), in almost all situations,k is better than the other two competitors. Small changes (for example, s = 0.3) are found with greater difficulty by usingk L andk Q , while the precision ofk remains high. For the alternative hypothesis (ii),k andk L have similar performance, and they are both slightly better thank Q . Alternative Hypothesis (i): Assume that the large probability changes while the small probability remains the same. Alternative Hypothesis (ii): Assume that the small probability changes while the large probability remains the same. Under alternative hypothesis (i), our method has better performance than the other two methods, probably because entropy as a non-linear function can increase the difference between frequencies, and it is more pronounced when the difference is small (e.g., s = 0.3). Table 3. Mean and standard deviation (in parentheses) of |k − k * |, |k L − k * | and |k Q − k * | under the alternative hypothesis (i) or (ii) with ω = 0.3, d = 5, and k * = 0.5K.   Finally, we simulate the power and estimation precision for alternative hypothesis (iii): where parameters in A and B change simultaneously, which was not mentioned in [20]. We compare our statistic andk with Q andk Q in this case. The results are displayed in Figure 2, Tables 5 and 6, from which we see that the power of G m,Â is slightly better than that of Q, and the precision ofk is obviously higher than that ofk Q .  Table 5. Mean and standard deviation (in parentheses) of |k − k * | and |k Q − k * | under alternative hypothesis (iii) with ω = 0.3, d = 5 and k * = 0.5K.

Example
In this section, we use a data set to address the applicability of our method. The data concern the medical examination results of people working in Hefei's financial sector (including banks and insurance companies) from 27 September 2017 to 25 August 2021, which includes each person's age, gender, the date of examination and the disease detected. From the perspective of health analysis and disease prevention, it is thus important to understand the diseases in terms of how often they are detected.
Our goal is to test whether the proportion of people who have been diagnosed with some diseases change over time. After removing gender-specific diseases, we finally choose 210 diseases. Because in some weeks there is no person to have the examination, we eliminate those weeks and finally keep 173 weeks. Let X i be a 210-dimension vector with each component indicating the frequency of a certain disease detected during the i th time period. Then, there are K = 173 vectors X 1 , . . . , X 173 of dimension m = 210 with N = 16596 outcomes. Figure 3 shows the numbers of the top 30 diseases detected. The weekly sample size n i s are provided in Figure 4. We find from Figure

Conclusions
This paper develops a change-point test based on MI for multinomial data when the number of categories is comparable to the sample size. We show that under certain conditions, the proposed statistic is asymptotically normal under the null hypothesis and consistent under the alternative hypothesis. The simulation results suggest that the test based on the proposed statistic has a high power. The proposed inference procedures are used to analyze the change in proportions of diseases detected in physical examination data during a period. Acknowledgments: The authors are grateful to the referees for their insightful comments in revising this paper.

Conflicts of Interest:
The authors declare no conflict of interest.
Proof of Lemma 1.
where κ 0 is defined in condition (ii) of Theorem 2.
Proof of Lemma A1. See the proof of Theorem 1 in Wang et al. [20] Proof of Theorem 1. By Lemma 1 and Lemma A1, it suffices to deduce the distribution of For simplicity, we suppress the subscript B in L kB , i.e., write L kB as L k . By a second-order Taylor expansion, The last inequality follows from the fact that Let P kj = N 0k N 1k N (q 0kj −q 1kj ) 2 /q j , and P k = ∑ j P kj . Then from the above, we have L k = P k + o p (1). Rewriteq 0k andq 1k as 1  Hence, Theorem 1 follows from L kB = P kB + o p (1).
Hence, with probability one, G m,Â > e m . By the condition that e m > cm as (m, K) → ∞, where c > 1 6 , we obtain that Therefore, the power converges to 1 as (m, K) → ∞. Now, assume that δ j ' s satisfy the condition (iv) in Theorem 2. By Theorem 2.2 in [36], for δ j ' > 0 for some j ' ∈ B, we have 2 K ∑ K k=1 N 0k N 1k N MI kB converges to a nonzero limit with probability one, which suggests that the power converges to 1 as (m, K) → ∞.